Introduction

The objective of this project was to compile data indicative of current data science skills in demand. We did this by creating multiple tables containing strings from blog posts along with populating data from actual job postings.

In doing this, we sought to answer three questions:

  1. How do the skills vary between the different job posting sites?
  2. How does the skills looked for in job postings compare to the skills recommended in articles and blogs?
  3. Is there any difference in the 3 categories: Languages, Fundamentals and Soft Skills.

Step 1: Listing skills based on job postings

Our first main step was creating a table that populated data based on 1) skills in demand from data blogs and 2) job postings with the skills listed.

Performing Web Scraping to fetch the skills using the node selector

We used the rvest package to use a node selector and subset the skills of interest:

# Start by reading a HTML page with read_html():
skills_page <- read_html("https://brainstation.io/career-guides/data-science-resume-examples#what-skills-should-you-put-on-a-data-science-resume")
skills_set1<- skills_page %>% html_nodes("ul:nth-child(4)") %>% html_elements("li")  %>% html_text2()
skills_set2<- skills_page %>% html_nodes("ul:nth-child(6)") %>% html_elements("li")  %>% html_text2()

From here, we can preview the vector:

writeLines(skills_set1)
## Data Courses
## Data Analytics
## Data Science
## Python
## Cybersecurity
## Data analysis
## Data wrangling
## Data modeling
## Statistics
## Data visualization
## Programming
## Quantitative analysis
## Machine learning
## Machine learning models
## Data mining
## Debugging
## Hypothesis testing
## A/B tests
## Regression
## Research the company, the role, and relevant data skills
## Reference resume templates and samples to build a resume outline
## Add relevant education experience, work experience and data projects to the correct section of your resume
## Highlight experience with machine learning and data tools
## Craft concise bullet points using the action verb + task + result format for each experience, emphasizing data-driven successes
## Have a trusted peer proofread your Data Scientist resume for grammar and spelling to make sure your experience is professionally presented
## Work on projects in a collaborative setting
## Take advantage of our flexible plans and scholarships
## Get access to VIP events and workshops

The next step we took was trimming extra spaces from the words in skill_set1 and skill_set2.

vector1 = c()
for (i in 6:19) {
  vector1[i-5] <- as.character(trimws(skills_set1[i], which = c("both")))
}

vector2 = c()
for (i in 3:34) {
  vector2[i-2] <- as.character(trimws(skills_set2[i], which = c("both")))
}

Here are the cleaned vectors:

vector1 %>% writeLines()
## Data analysis
## Data wrangling
## Data modeling
## Statistics
## Data visualization
## Programming
## Quantitative analysis
## Machine learning
## Machine learning models
## Data mining
## Debugging
## Hypothesis testing
## A/B tests
## Regression
vector2 %>% writeLines()
## R
## Python
## C
## C++
## C#
## HTML
## Java
## JavaScript
## PHP
## SAS
## SQL
## Scala
## MATLAB
## SQL Server
## NoSQL
## Hadoop
## OpenRefine
## TensorFlow
## Cloudera
## Tableau
## Microsoft Excel
## Octave
## Spark
## PowerBI
## Plotly
## Bokeh
## Matplotlib
## Seaborn
## Keras
## Pytorch
## AWS
## Hive

Soft Skills: We can start by reading a HTML page with read_html():

softskills_page <- read_html("https://zety.com/blog/data-scientist-resume-example")

From here, we can scrap these using node selector

skills_set3<- softskills_page %>% html_nodes("td:nth-child(1) > p > a") %>% html_text2()

skills_set3[3:9]
## [1] "Soft Skills"          "Communication"        "Collaboration"       
## [4] "Creative Thinking"    "Critical Thinking"    "Problem Solving"     
## [7] "Interpersonal Skills"

As before, we now have a vector of soft skills:

vector3 = c()

for (i in 3:9) {
  
  vector3[i-2] <- as.character(trimws(skills_set3[i], which = c("both")))
  
}

vector3
## [1] "Soft Skills"          "Communication"        "Collaboration"       
## [4] "Creative Thinking"    "Critical Thinking"    "Problem Solving"     
## [7] "Interpersonal Skills"

We can now combine the vectors with technical skills and the vectors containing soft skills.

title_vector <- c(vector1, vector2, vector3)

Adding the values in the columns

From there, we manually searched multiple job listing sites and populated tables based on the number of results:

tbl1.siteskills = data.frame(skill_title = title_vector,
                indeed = indeed_jobs,
                linkedin = linkedin_jobs,
                simplyhired = simplyhired_jobs,
                ziprecruiter = ziprecruiter_jobs
                )
head(tbl1.siteskills) %>% kable() %>% kableExtra::kable_styling(bootstrap_options = "basic")
skill_title indeed linkedin simplyhired ziprecruiter
Data analysis 46,915 580,822 27,586 1,685,359
Data wrangling 2,660 3,329 1,233 434
Data modeling 28,328 135,443 13,702 623,648
Statistics 2,327 216,875 44,456 288,970
Data visualization 41,727 119,686 37,409 170,268
Programming 9,444 510,656 87,187 2,047,037

We'll refer to this as Table 1 moving forward. We can export this table into a working directory:

# Get working directory path
path <- getwd()
path
## [1] "/Users/deepikadilip/git/Data_607_Project3"
write.csv(tbl1.siteskills, file.path(path, "skills_cleaned_data.csv"))

Step 2: Integration with Existing Tables

From here, we can join this table with existing data we've scraped:

Table 2: Skill Categories

tbl2.skillcat = read.csv('SkillCategories.csv')
head(tbl2.skillcat) %>% kable() %>% kableExtra::kable_styling(bootstrap_options = "basic")
skill_title SKILLID skill_category
Data analysis 1 Fundamentals
Data wrangling 2 Fundamentals
Data modeling 3 Fundamentals
Statistics 4 Mathematics
Data visualization 5 Fundamentals
Programming 6 Fundamentals

Table 4: Skill Recommendation Websites

tbl4.skillsites = read.csv('RecommendationSites.csv')
head(tbl4.skillsites, 2) %>% kable() %>% kableExtra::kable_styling(bootstrap_options = "basic")
RecommendationID URL Site Title Year
1 https://www.thebalancecareers.com/list-of-data-scientist-skills-2062381 the balance careers Important Job Skills for Data Scientist 2021
2 https://www.analyticsvidhya.com/blog/2020/11/14-must-have-skills-to-become-a-data-scientist-with-resources/ Analytics Vidhya 14 Must-Have Skills to Become a Data Scientist (with Resources!) 2020
In total, we have the following tables:
Table1
skill_title
indeed
linkedin
simplyhired
ziprecruiter
SKILLID
Table2
skill_title
SKILLID
skill_category
Table3
Skill
RecommendationID
InJobSkills
Table4
RecommendationID
URL
Site
Title
Year

Step 3: Answering Questions

Job Demand Distribution

We can start visualization by looking at the job demand distribution in Table 1. To do this, we'll need to melt the dataset prior to using ggplot:

mlt.tbl1 = pivot_longer(tbl1.siteskills, cols = c("indeed", "linkedin", "simplyhired", "ziprecruiter"), values_to = "count")
mlt.tbl1 = mlt.tbl1 %>% mutate(count = as.integer(gsub("\\,", "", count)))
mlt.tbl1.merged = mlt.tbl1 %>% merge(tbl2.skillcat, by = c("skill_title", "SKILLID"), all.x = T)

mlt.tbl1.merged$name = str_to_sentence(mlt.tbl1.merged$name)

plt.sites = ggplot(mlt.tbl1.merged, aes(skill_title, count, fill = factor(skill_category))) + geom_bar(stat = "identity") + facet_wrap(~name, ncol = 1, scales = "free_y") + theme_minimal() + scale_fill_economist() + labs(x = "Skill", y = "Count", fill = "Skill Category") + theme(axis.text.x = element_text(size = 8, angle = 90, hjust = 1, vjust = 1), legend.position = "bottom") 

ggplotly(plt.sites) #Makes it interactive
<<<<<<< Updated upstream
=======
>>>>>>> Stashed changes

This plot is a bit detailed, so we can visualize this distribution in more granular "buckets":

mlt.tbl1.merged$skill_bucket = ifelse(mlt.tbl1.merged$skill_category == "Languages", "Languages", ifelse(mlt.tbl1.merged$skill_category == "Soft Skills", "Soft Skills", "Tech Data"))

plt.sites.aggregated = ggplot(mlt.tbl1.merged, aes(skill_title, count, fill = factor(skill_category))) + geom_bar(stat = "identity") + facet_wrap(~ skill_bucket + name, ncol = 2, scales = "free") + theme_minimal() + scale_fill_economist() + labs(x = "Skill", y = "Count", fill = "Skill Category") + theme(axis.text.x = element_text(size = 8, angle = 90, hjust = 1, vjust = 1), legend.position = "bottom") 

ggplotly(plt.sites.aggregated, width = 800, height = 1500)
<<<<<<< Updated upstream
=======
>>>>>>> Stashed changes

If we don't want to stratify by job website and visualize an overview instead, it will appear as the following:

tbl1.per_bucket = mlt.tbl1.merged %>% group_by(skill_title, skill_bucket) %>% summarise(count = sum(count))

plt.bucket = ggplot(tbl1.per_bucket, aes(skill_title, count, fill = factor(skill_bucket))) + geom_bar(stat = "identity") + facet_wrap(~ skill_bucket, ncol = 1, scales = "free") + theme_minimal() + scale_fill_economist() + labs(x = "Skill", y = "Count", fill = "Skill Category") + theme(axis.text.x = element_text(size = 8, angle = 90, hjust = 1, vjust = 1), legend.position = "bottom") 

ggplotly(plt.bucket, width = 800, height = 1500)
<<<<<<< Updated upstream
=======
>>>>>>> Stashed changes

As an extra bit, we can run some basic statistical tests across the board to see if any particular skills are listed more often in one website than others. We'll start by running a proportions test and seeing which skillls differ the most across websites:

indeed linkedin simplyhired ziprecruiter <<<<<<< Updated upstream ======= In.Recommended.Skills. >>>>>>> Stashed changes test.statistic
Communication 0.0060592 <<<<<<< Updated upstream 0.4471800 ======= 0.4471799 >>>>>>> Stashed changes 0.0040841 <<<<<<< Updated upstream 0.5426767 13330475 ======= 0.5426765 4.0e-07 18815965 >>>>>>> Stashed changes
Hypothesis testing 0.0016622 0.0015483 0.0010033 <<<<<<< Updated upstream 0.9957862 8818391 ======= 0.9957858 4.0e-07 11030788 >>>>>>> Stashed changes
Programming <<<<<<< Updated upstream 0.0035580 0.1923865 0.0328472 0.7712084 5420010 ======= 0.0035579 0.1923851 0.0328469 0.7712029 7.2e-06 7180952 >>>>>>> Stashed changes
Data analysis 0.0200433 <<<<<<< Updated upstream 0.2481422 0.0117855 0.7200290 4126554 ======= 0.2481416 0.0117854 0.7200272 2.6e-06 5567239 >>>>>>> Stashed changes
Problem Solving <<<<<<< Updated upstream 0.3125545 0.5271276 0.1508640 ======= 0.3125529 0.5271250 0.1508633 >>>>>>> Stashed changes 0.0094539 <<<<<<< Updated upstream 2876377 ======= 5.0e-06 4506362 >>>>>>> Stashed changes

We'll do the same for those that differ the least among websites:

prop.table  %>%  arrange(desc(test.statistic)) %>% tail(5)  %>% kable() %>% kable_styling(bootstrap_options = "basic") 
indeed linkedin simplyhired ziprecruiter <<<<<<< Updated upstream ======= In.Recommended.Skills. >>>>>>> Stashed changes test.statistic
<<<<<<< Updated upstream Octave 0.1481013 0.7936709 0.0417722 0.0164557 1701.61 ======= Plotly 0.1740139 0.2114683 0.1143520 0.4998343 0.0003315 2600.47 >>>>>>> Stashed changes
<<<<<<< Updated upstream Plotly 0.1740716 0.2115385 0.1143899 0.5000000 1417.67 ======= Octave 0.1479140 0.7926675 0.0417193 0.0164349 0.0012642 2235.63 >>>>>>> Stashed changes
Seaborn <<<<<<< Updated upstream 0.2858482 0.3514527 0.2052484 0.1574508 126.02 ======= 0.2855805 0.3511236 0.2050562 0.1573034 0.0009363 478.18 >>>>>>> Stashed changes
Bokeh <<<<<<< Updated upstream 0.2821369 0.3005008 0.1602671 0.2570952 37.33 ======= 0.2816667 0.3000000 0.1600000 0.2566667 0.0016667 228.06 >>>>>>> Stashed changes
OpenRefine <<<<<<< Updated upstream 0.3557692 0.1826923 0.3269231 0.1346154 19.38 ======= 0.3523810 0.1809524 0.3238095 0.1333333 0.0095238 52.26 >>>>>>> Stashed changes

Key observations:

  • Soft skills were quite consistent throughout job listings. Communication in particular was emphasized across the board, followed by problem solving.
  • With respect to languages, job descrptions appeared to emphasize querying languages (such as SQL) over programming languages such as Python and R. However, when excluding SQL, both Python and Javascript appeared in a considerable number of job listings.
  • Software packages were different across the board. Seaborn and plotly had similar distributions across the job sites, but Matplotlib was seen more often in ZipRecruiter.
<<<<<<< Updated upstream =======

Differences by Category

First, we look at those skills that are in both the Job Postings and in the articles/blogs with recommended skills

color = c("#008080", "#428bca", "#d2afff", "#dab600", "#979aaa")

mlt.tbl1.merged %>% 
#change the blanks to NA 
  mutate(In.Recommended.Skills.=
           ifelse(In.Recommended.Skills.=="", 
                  NA, In.Recommended.Skills.)) %>%
  mutate(skill_category = 
           ifelse(skill_category =="Software" |
                    skill_category =="Database Management" |
                    skill_category == "Fundamentals" |
                    skill_category == "Mathematics", 
                  "Fundamentals", skill_category)) %>%
#filter out NAs
#now have job posting skills that were also in recommendations
  filter(!is.na(In.Recommended.Skills.)) %>%
  treemap (index=c("skill_category","skill_title"),
           vSize = "count",
           type="index",
               fontsize.labels=c(15,12),                
    fontcolor.labels=c("white","blue"),
    fontface.labels=c(2,1),                 
    bg.labels=c("transparent"),              
    align.labels=list(
        c("center", "top"), 
        c("right", "bottom")),                                
    overlap.labels=0.5,                   
    inflate.labels=F,
  border.col=c("black","white"),  
    border.lwds=c(7,2),
  palette = color,
  title="Skills by Category in Both", 
    fontsize.title=10
  )
## Warning in if (class(try(col2rgb(bg.labels), silent = TRUE)) == "try-error")
## stop("Invalid bg.labels"): the condition has length > 1 and only the first
## element will be used

Next, we look at those skills that are in the Job Postings BUT NOT in the articles/blogs with recommended skills

color2 = c("#428bca","#d2afff","#dab600","#979aaa","#008080")

#which category has skills in job postings and in recomm
mlt.tbl1.merged %>% 
#change the blanks to NA 
  mutate(In.Recommended.Skills.=
           ifelse(In.Recommended.Skills.=="", 
                  NA, In.Recommended.Skills.)) %>%
  mutate(skill_category = 
           ifelse(skill_category =="Software" |
                    skill_category =="Database Management" |
                    skill_category == "Fundamentals" |
                    skill_category == "Mathematics" |
                    skill_category =="Data Warehouse" |
                    skill_category =="Visualization",
                  "Fundamentals", skill_category)) %>%
  mutate(skill_category = 
           ifelse(skill_category =="Python Packages",
                  "Languages", skill_category)) %>%
#filter to keep only NAs
#now have job posting skills that were NOT in recommendations
  filter(is.na(In.Recommended.Skills.)) %>%
  treemap (index=c("skill_category","skill_title"),
           vSize = "count",
           type="index",
               fontsize.labels=c(15,12),                
    fontcolor.labels=c("white","blue"),
    fontface.labels=c(2,1),                 
    bg.labels=c("transparent"),              
    align.labels=list(
        c("center", "top"), 
        c("right", "bottom")),                                
    overlap.labels=0.5,                   
    inflate.labels=F,
  border.col=c("black","white"),  
    border.lwds=c(7,2),
  palette = color2,
  title="Skills by Category in Posting only", 
    fontsize.title=10,
  )
## Warning in if (class(try(col2rgb(bg.labels), silent = TRUE)) == "try-error")
## stop("Invalid bg.labels"): the condition has length > 1 and only the first
## element will be used

What the Data Sci Skill Articles Got Right (and didn't)

There was a good amount of overlap in the skills for Data Science recommended by articles and the job postings:

The largest group in this case is the Soft Skills, specifically, Communication, Problem Solving, Collaboration and Critical Thinking. After that came Fundamentals, with skills like Programming, Data Analysis and Excel (the top 3 in this category). The smallest group here was Languages, with SQL, Python, Java and C++ as the top 4.

When we look at the skills the articles MISSED recommending, we see that Fundamentals is just slightly bigger than Soft Skills, and Languages is just a bit smaller. Top skills by categories that were MISSED by the articles are:
Fundamentals: Hypothesis Testing
Soft Skills: Interpersonal Skills
Languages: JavaScript, HTML, C# and C

>>>>>>> Stashed changes